1 Introduction
In this book we are going to explore the topic of Machine Learning for Credit Card Fraud Detection (which will be further referred to as ML and CCFD).
The problem with using ML for CCFD is that ML models require huge sets of data (sometimes in the millions of examples, and sometimes requiring human labeling to be usable) to give the model enough information that it can extract general patterns.
These general patterns can be used to simply describe how the data is coming into existence, or much more excitingly, predict outcomes, behaviors, and variables of future data points.
Examples of these questions can range from “what are the chances it will rain tomorrow given the weather today?” to “how likely am I to miss a lecture given I stayed up all night reading about Bayesian probability?”.
The problems that arise are twofold.
First, collection of CC transaction data (tx data) is difficult, because it is all kept private by a handful of paranoid delusional conglomerates.
These companies all keep their data internally, which makes research by the broader scientific community near impossible.
The only well known publicly available data set is about ten years old as of writing (in 2022) and has a mere few hundred thousand entries- peanuts by today’s measure.
Companies like Google run neural networks (NN’s) with billions of synthetic neurons, over multiple terabytes of data (in real time no less).
As groundbreaking as this original public data set was for pushing research forward, it represents a fraction of a time stamp in terms of consumer transaction history, meaning it is really incapable of taking the study of the problem to the next level.
Studying and perfectly understanding one cell is great, but being able to study millions of cells interacting together is a lot more useful to biology, wouldn’t you agree?
Not to mention we have another issue, which is that fraudulent transactions are not some static occurrence, but methods and behaviors of fraudsters change with time.
To put more bluntly, fraudsters are not committing fraud exactly the same way they were ten years ago!
New technologies have created new security measures (chip technology) and opened up new attack vectors- most fraudulent activity now takes place in what are called Card Not Present (CNP) scenarios (think online shopping).
Much of this would have been considered outlier data or simply not possible in the original data.
This leads us to our second major issue, what is commonly referred to as class imbalance.
All transactions in a given data set are classified as either genuine or fraudulent (in general a zero for genuine and a one for fraud).
Because almost all transactions are genuine (>99%) what ends up happening is you have a massive imbalance in the amount of data in each class.
One can see how this easily causes problems for CCFD. A bad algorithm can easily get 99% accuracy on a given data set and still miss every single fraudulent transaction!
It learned nothing, was completely useless, and still got a 99% on the test (so many jokes I’m not writing here).
Of regression algorithms in use today, there is one that stands out in its ability to tackle this specific set of problems: Logistic Regression (LR).
LR is incredibly important to predictive analysis due to its 3 main features:
predict probability that a response variable = 1
categorize outcomes
assess odds ratios associated with model prediction
These features individually are not unique to LR, however logistic regression is the only model that can do all 3.
Let’s continue our journey by unpacking what Logistic Regression is, and showing how it can help us in CCFD.